How to introspect the Freebase schema with SPARQL
... as well as other RDF databases
Prelude
This post is a followup to How to write SPARQL queries against Freebase data and is part of a series. Subscribe to the RSS feed and to the :BaseKB mailing list for future episodes.
The data set I'm using is the 2014-03-02 edition of :BaseKB Gold. You can download this via Bittorrent and load it into any standard-complaint triple store, but it's even faster to use the pre-loaded Compact Edition which can deploy perfectly matched hardware, software and data in just one click.
The need for schema inspection
It's hard to imagine a data model simpler than RDF, based on two concepts, the Node and the Triple. What is hard is getting a handle around a database that has 800 million facts and 60,000 properties. If you don't know which properties to use, the data you want can be hidden in plain site. Thus, one of the big problems in using this kind of database is understanding its contents.
Opening the T-Box
Because RDF schemas are expressed in RDF, you can use ordinary SPARQL questions to ask questions about schemas. In our last episode, we used the
:geography.river.location
property, so let's take a look at the schema information for this SPARQL query, which displays all triples that have :geography.river.location
as a subject.
prefix : <http://rdf.basekb.com/ns/>
select ?p ?o {
:geography.river.length ?p ?o .
}
this gives the following result
Note here we get a mix of RDFS and OWL vocabulary together with some Freebase-specific vocabulary. Note that instead of the standardxsd:float
, the data type of the object (rdfs:range
) is recorded as :type.float
. The Freebase property has the :type.property.expected_type
that corresponds to rdfs:range
and :type.property.schema
that corresponds to rdfs:domain
.
Note that the triple store I'm running has RDFS inference turned off, so we are seeing facts asserted in Freebase, not facts that could be inferred via RDFS or OWL.
Knowing that freebase properties typically have
:domain.type.property
structure you'd imagine now that Freebase encodes schema data under the :type
domain and you'd be right. Look at
https://www.freebase.com/type?schema=
and you'll find a number of types which represent metainformation about Freebase. As you'll see later, you can also find schema information in other Freebase domains.
Let'stake a look at the schema for :type.type
, the Freebase version of rdfs:Type
, we see
https://www.freebase.com/type/type?schema=
Facts with the :type.type.instance
predicate have been deleted from :BaseKB because they are bulky (there is one for each a
in the system) and also because you end up with some subjects having millions of :type.type.instance
facts, which blows out memory if you try to bring all facts with a given subject together in one place.
Double Vision
Yes, the image has a backstory
If we take a look at facts concerning :geography.river
prefix : <http://rdf.basekb.com/ns/>
select ?p ?o (lang(?o) AS ?lang) {
:geography.river ?p ?o .
}
we see we have labels in many languages
but many of the facts that we'd expect to have in the schema (such as those that apply to :type.type
) are missing. It turns out that many of these are registered under the mid identifier for :geography.river
, which we can find by doing a key lookup
prefix : <http://rdf.basekb.com/ns/>
select ?riverMid {
?riverMid :type.object.key "/geography/river"
}
We get
Next I'll look up the facts with :m.01xs05k
on the left-hand-side, excluding the labels (which are the same as the other labels) so the results fit in a screenshot
prefix : <http://rdf.basekb.com/ns/>
select ?p ?o {
:m.01xs05k ?p ?o .
FILTER(?p != rdfs:label)
}
Looking at these facts we see find all kinds of cool stuff, including summary data.
It's straightforward now to write SPARQL to answer questions about the schema, for instance, we can get a list of properties
prefix : <http://rdf.basekb.com/ns/>
select ?propId ?propLabel {
:m.01xs05k :type.type.properties ?prop .
?prop :type.object.id ?propId .
?prop rdfs:label ?propLabel .
FILTER(LANG(?propLabel)='en')
}
with the following results:
Just to make clear what is going on, you'll never actually see :m.01xs05k
in the predicate field
prefix : <http://rdf.basekb.com/ns/>
select (count(*) as ?cnt) {
?s :m.01xs05k ?o .
}
The weirdness here is because of the nature of Freebase.
The "real" identifiers in Freebase are the mid identifiers, which are more-or-less sequential integers. graphd, the internal Freebase database, resolves names like '/geography/river' to mids when processing queries, and then converts names back to 'human friendly' form for display.
A major difficulty with processing the old freebase quad dump
was that the quad dump did not use consistent identifiers in the various fields, which meant that it was not possible to do any processing that joined the schema with the data. The original :BaseKB fixed this problem by resolving all identifiers to mids but this mean that queries looked like this:
prefix : <http://rdf.basekb.com/>
select ?river ?length {
?river :m.01xs0f4 ?length .
?river :m.014h :m.06bnz .
} ORDER BY DESC(?length) LIMIT 1
Writing queries like this is a bit like coding in assembly language; the superficial difficulties can be fixed by rewriting queries to imitate graphd
's name resolution behavior, but when Freebase switched to an official RDF dump, they commited to using consistent identifiers for predicates and :BaseKB followed.
Included Types
Another funny thing about Freebase is the concept of an 'Included Type' which is similar, but not quite identical, to the RDFs concept of an 'Included Type'. If we turtleize the relevant properties of the :geography.river
mid, we get
:m.01xs05k
:freebase.type_profile.strict_included_types
:m.01n7 , :m.02h5yxm ;
:freebase.type_hints.included_types
:m.01m7, :m.01y2jks, :m.01c5, :m.02h5yxm .
The included_types property has been around since Jan 2007, strict_included_types is newer, created in Feb 2013.
The original included_types came out of the requirements for a community-edited database. For instance, ':people.person' is an included type of ':book.author' because the author of a book is usually a person. This means that when somebody adds an author to a book, Freebase automatically assumes that this a person. Although it's not factually true that authors are always people, it's true enough that we get better results assuming this rather than expecting users to tag authors as persons manually.
(If we believe Freebase, there are 4360 authors who are not people, out of 533,452. The query below counts the un-people.)
prefix : <http://rdf.basekb.com/ns/>
select count(*) as ?cnt {
?author a :book.author .
minus {
?author a :people.person .
}
}
In the case of river, we can look up the included types like so,
prefix : <http://rdf.basekb.com/ns/>
select ?that ?id {
?that :type.object.id ?id .
:m.01xs05k :freebase.type_hints.included_types ?that .
}
Next steps
Since schema information can be expressed in RDF, RDF schemas can be explored using SPARQL.
Although Freebase uses some standard vocabulary, most schema information is expressed with non-standard vocabulary. This makes sense, since the Freebase schema supports a collaborative editing interface rather than RDFS inference.
It takes just a little knowledge, outlined in this article and documented in Freebase, to ask questions about the Freebase schema in SPARQL. This knowledge can be the basis for RDF-based Freebase browsing interfaces (to be discussed in a future post), conversion to RDFS/OWL schemas that can be used with tools like Protégé, as well as hand-written SPARQL queries.
This post is the first of a series: future posts will cover compound value types, how to look up identifiers, and other topics. Subscribe to our RSS feed and the :BaseKB mailing list.
Creator of database animals and bayesian brains